Skip to main content

Reply to this post | Go Back
View Post [edit]

Poster: Fizscy Date: Dec 27, 2011 8:29am
Forum: web Subject: Why does the wayback machine pay attention to robots.txt

Honestly, half of the internet is being missed because some honest do-gooder decided that the garbage that is robots.txt should be followed by this archival service. This needs to stop.

The wayback machine is exempt from copyright issues under fair use doctrine and due to its educational purpose.

Please stop ignoring website because of ignorant, uninformed, or possessive webmasters.

Reply to this post
Reply [edit]

Poster: Gameboy Genius (nitro2k01) Date: Jan 21, 2016 2:13pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

I will just have to concur with previous speakers that the situation is less than ideal. There are several ways this could be solved technically. One relatively easy way, if the owner's legitimate right to be excluded should be respected, is to create a list of domain hijackers and ignore the retroactive delisting if the domain is currently owned by a known hijacker. I would argue that this is currently the biggest hindrance to archive.org's stated goal of preserving web information.

This post was modified by Gameboy Genius (nitro2k01) on 2016-01-21 22:13:25

Reply to this post
Reply [edit]

Poster: '=-/-=-/=#- Date: Jul 30, 2013 4:18pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

Yes it's a sad situation, robots.txt is an abomination. But nothing short of policy change or some kind of hack will make available the uncensored data, which is not deleted, just rendered inaccessible, seemingly forever.

Reply to this post
Reply [edit]

Poster: Hobbyboy Date: Nov 14, 2014 6:43am
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

I completely agree that the internet archive shouldn't follow robots.txt for several reasons.
When a hostmaster adds a robots.txt, it blocks the whole site on the internet archive from being viewed, including the archived versions, which ends up breaking references from other websites.
Also, it stops you from being able to find a copy of old software that isn't available to download anywhere anymore. For example, I was trying to look for a copy of Ubuntu Studio 8.04, which wasn't on the Ubuntu archive for some reason, but the internet archive had it mirrored. If the Ubuntu archive added a robots.txt, it would be unavailable to download anywhere. Robots.txt is basically putting history in a locked up room, and throwing away the key.

Reply to this post
Reply [edit]

Poster: Infenwe Date: Jul 19, 2014 12:36pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

One thing that I find very puzzling about this policy is that robots.txt filtering appears to be *retroactive*, i.e. archive.org could have mirrored a site in 2004 and ten years later in 2014 /robots.txt is altered to say "Disallow: /" and *POOF* archive.org refuses to display this old data any longer. Either because it's purged or because it won't.

That seems like a distinctly suboptimal design decision to me. One way that it seems likely to happen is this:

1) up to 2005-ish: site is held by its original owners and is doing reasonably well.
2) 2006-ish: site goes under. Domain taken over by squatter.
3) 2014-ish: Squatter's account gets suspended due to abuse. The people who put up the suspension notice also put in a /robots.txt to disallow crawling of /.

And voilà! Legitimate content that no one ever wanted to get purged from archive.org is suddenly gone.

Case in point: http://web.archive.org/web/20070103112847/http://www.infoceptor.com/

Reply to this post
Reply [edit]

Poster: DKL3 Date: Jul 20, 2014 4:57am
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

Gahhh! I really don't like it that a small <100 KB text file destroys pieces of internet history.

Anyway, I really REALLY REALLY despise robots.txt. It gets planted on a site, a lot of the time if the site shuts down, and information that was once accessible on that glorious Wayback Machine site is no longer accessible. Darn it, people.

P.S. the crawls do exist on the Wayback Machine site. When I search a URL that's been excluded (by request/robots.txt), the page posts me a hidden 403 Forbidden error.

How fair is that? They still have it archived, but we, the public, cannot gain access to it? This is an abomination! I even have proof of this from an experience with a site mentioned in the next paragraph.

And there is a site (http://web.archive.org/*/nintendoweb.com) that is suffering from robots.txt problems because some stupid, untrusted site links its robots text file to Nintendo Web's one. Sometimes there's a gateway through, but not always. This was not intended by the webmaster. He never had robots on his site until this stupid spam came along to the site, deleting its contents and controlling it.

And then there are these people who change their robots exclusion standards. What may have once been a few directories blocked from being archived could turn into the entire site, all at the webmaster's will! I once got to crawl Wal-Mart's old catalog pages on archive.org, but they later changed their robots policy to disallow crawling of the "catalog" directory.

And the other people have very good points here, especially the fact that robots are usually used by people who are selfish (in terms of not willing to correct a crawling mistake) or try to block one crawler but not another (like archive.org).

Careful, as the robots text file might strike at any moment. I suggest you save/archive your favorite old web pages on this machine before they get the "robots" move.

And if you think this is not enough, may I also say that the ArchiveTeam, an affiliate of archive.org and a team loved by thousands, also hates robots.txt? Link: http://archiveteam.org/index.php?title=Robots.txt

C'mon ladies and gentlemen, let us destroy this robots.txt policy! We can live without it. If Nintendo of America (http://nintendo.com/robots.txt) can do it, everyone can! Who's with me?!? We will start an Avaaz-based request! :D


Sign the petition here: https://secure.avaaz.org/en/petition/The_Internet_Archive_Include_Every_Site_on_the_Wayback_Machine_Regardless_of_Robotstxt

This post was modified by DKL3 on 2014-07-20 11:57:26

Reply to this post
Reply [edit]

Poster: Thestral Date: Apr 25, 2014 12:50pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

This is really an abominable situation. Years ago a domain squatter stole a domain name right from under me during renewal and proceeded to ruin our site's good name by posting porn ads on their landing page. Now, because this domain thief is using robots.txt to prevent crawling, the only record of my site's existence online has been wiped out.

How in the world is a robots file, posted recently, stopping me from viewing my own old site's history from almost a decade ago just because its on the same domain? This robots policy needs to be changed. Who's bright idea was it to assume that the current owner of a domain has the right to erase that domain's history from the Wayback Machine? A history that may include the IP of people who want inclusion? This is galling.

This post was modified by Thestral on 2014-04-25 19:44:37

This post was modified by Thestral on 2014-04-25 19:50:33

Reply to this post
Reply [edit]

Poster: DKL3 Date: Apr 9, 2014 3:34pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

I will have to agree on this one. The robots text file is quite limiting. Is there an alternative way to find a webpage's old crawls?

There is a similar dillemma where a webmaster wants their site excluded from the Wayback Machine (e.g. Nintendo of Europe). Plain ridiculous stuff right there. The UK only had a temporary feud with archive.org, and some sites are still blocked because of it.

This life is clearly losing its edge.

Reply to this post
Reply [edit]

Poster: carehart Date: Apr 19, 2014 5:52pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

Yes, this really is quite tragic to learn (that if a webmaster puts in a robots.txt, it could--and likely unexpectedly to them--cause the entire archived history for that site here to become available), as discussed at https://archive.org/about/faqs.php#14.

The tragedy is that it could be entirely unexpected by the site owner that this will be the result.

They may be doing it simply to block "all bots", naively. They may be intentionally blocking "alexa" without realizing that it really is for the benefit of archive.org. Or they (if asked) may want to block alexa/archive.org going forward but not necessarily lose all the history.

And as others have noted, it may be that someone other than the previous site owner has taken over the domain, and they may have no particular interest/motivation about preserving the previous content (but they may also have no specific desire to block access to it).

I agree with the previous poster who proposed that it seems the archive.org organizers should instead opt to keep archived content until someone specifically requests it be removed.

I appreciate that for some folks, the archive may seem just a cute way to look at really old views of web sites (and the "wayback machine" moniker doesn't really help from that perspective).

But for years I've helped people leverage it to find old files or content that was a) important in the past but b) now missing from the site in question--usually just because it was no longer current or important enough for the site owner to keep it around.

But it is so often still very important for folks actively seeking it. I know I turn to it about every week or two for something, across hundreds of sites the past few years, and more and more often I am heartbroken to find that content is now "blocked" for this reason.

To be clear, I'm just a user of archive, and a very long-time fan. I realize that there may be more to this than is obvious from that FAQ or from the replies in this thread (and a couple other forum threads here).

But with all due respect, if there is not some more substantial justification that we may be missing, this decision (of a simple robots.txt on a site basically removing all the archive.org history for that site) seems the very definition of draconian: "very severe or cruel".

This post was modified by carehart on 2014-04-20 00:52:43

Reply to this post
Reply [edit]

Poster: carehart Date: Apr 19, 2014 6:04pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

I just found this page where site owners are told how they can intentionally exclude their site from the archive.org crawling:

http://archive.org/about/exclude.php

So in fact, this could be contributing to the problem. (I still think many of the blocks could be simply because people are naively blocking all spiders, or blocking Alexa but not realizing it's about archive.org).

What I mean is: why don't the archive.org and Alexa folks come up with *ANOTHER directive* that they tell site owners here to put in, which could for instance distinguish between whether they want their site crawled going forward versus whether (or not) they want their content in the archive from the past to remain or not.

I just really suspect that at least for some who WOULD read this page, they may well opt for blocking crawling going forward without removal of past archived content.

To any who would say "but there's no robots.txt standard directive that would suit this", I would point out that hte robots.txt "standard" is pretty wishy-washy. There are plenty of "standard" directives that some crawlers don't honor. And that means that there are plenty of directives people add to their robots.txt that are ignored by spiders.

More to the point, I mean that if the folks here/alexa came up with directives that were meaningful only to the alexa crawler and specific to archive.org, there would be no harm if folks added them. Because again, these would not be the only directives they may add which would not be meaningful to ALL spiders (that look at the file).

The robots.txt concept is more a "convention" than a "standard", as there is no official standards body. More at http://en.wikipedia.org/wiki/Robots_exclusion_standard. So I really think there'd be no harm in my proposal, "unconventional" though it may be.

Is there anyone following this thread who may be in a position of responsibility to help us know if that might ever even be considered or discussed?

Reply to this post
Reply [edit]

Poster: archivefcc Date: Apr 2, 2015 11:36pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

Yes. I'm trying to ask a webmaster to unblock the internet archive, but there is no archive.org link I can point him to that will explain how he can do so.

Why in the world does http://archive.org/about/exclude.php not ALSO explain how to ALLOW archiving by the internet archive while preserving blocks for other crawlers?

According to http://www.robotstxt.org/robotstxt.html, it would be something like.

User-agent: ia_archiver
Disallow:

User-agent: *
Disallow: /

Reply to this post
Reply [edit]

Poster: DKL3 Date: Apr 20, 2014 11:25am
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

I kind of do agree with you about leaving this open for discussion. Archiving a site is not a big deal, to be honest with you. It essentially is just taking a website and modifying its URL to be compliant with this serivce.

That's all it simply is.

Reply to this post
Reply [edit]

Poster: DKL3 Date: Apr 9, 2014 3:34pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

I will have to agree on this one. The robots text file is quite limiting. Is there an alternative way to find a webpage's old crawls?

There is a similar dillemma where a webmaster wants their site excluded from the Wayback Machine (e.g. Nintendo of Europe). Plain ridiculous stuff right there. The UK only had a temporary feud with archive.org, and some sites are still blocked because of it.

This life is clearly losing its edge.

Reply to this post
Reply [edit]

Poster: Detective John Carter of Mars Date: Dec 27, 2011 3:01pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

@http://www.archive.org/about/faqs.php#2
"The Internet Archive is not interested in preserving or offering access to Web sites or other Internet documents of persons who do not want their materials in the collection."

Reply to this post
Reply [edit]

Poster: PiRSquared Date: Sep 6, 2014 9:20pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

What if the domain is squatted/taken over by another person?

Reply to this post
Reply [edit]

Poster: d0c5i5 Date: Jan 21, 2015 2:32pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

aarrrgggg...

Why hasn't this been fixed? I used to find so many things that I can't find because these domain pirates are buying up barely used/forgotten/lapsed domain names and often put in robots.txt (along with countless USELESS ads to nowhere)...

Look, I love collecting old hardware or resurrecting old hardware from countless places and doing stuff with them. Like so many many linux/GNU projects there may be few or scare references to how it was done, pieces of code, or even small downloads that are completely worthy of being preserved, but as the hardware ages (or the authors literally die), this data gets erased from history and I'm often left with links to source code/downloads/whatever refernced in forums that point to what was free/open data (even LICENSED as distributable, if GNU/GPL applies, so I doubt the new owner trying to make a buck off all the people that could end up on the domain they snached has any more claim than I do)....

Hmmm... If I were to name my kid "Disney", and disney died/forgot to fill out a form, etc, would/could I ever wipe out all of the Disney movies from history?

Reply to this post
Reply [edit]

Poster: d0c5i5 Date: Feb 21, 2015 1:26pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

I'm glad to see this discussion is on-going. I'm going to create some scripter to replace my archive look-ups with that little trick, so I can access those records with a click if I come across them.

Regarding how this should be handled, imho, is that robots.txt should only be honored at crawl time. Period. (Esp if they didn't include the robots.txt back on the crawled date)

If someone wants to remove OLD data for a domain they now own AND they owned in the past, then they should do the leg work. Archive.org could offer a service where if you provide specific proof of ownership, possibly a legitimate claim for why it should be removed, and perhaps a fee to pay a trusted 3rd party to evaluate your request, then and only then, should they consider removing the records.

I just think about this, and fast forward 50 years, and they amount of both unintentional and intentional censorship that will happen, and it makes me sad. I know we are moving into the future, but I think archive.org is one of the shining examples of why the past matters, and it shouldn't be wiped away without a reason.

my 2c,
d0c

Reply to this post
Reply [edit]

Poster: PiRSquared Date: Jan 21, 2015 2:51pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

I think it would make sense to show data from squatted domains, even if the current owner forbids it via robots.txt. Anyway, robots.txt was meant to prevent crawlers from visiting a site, but we're talking about displaying already-crawled data. Do you have a specific example site?

Reply to this post
Reply [edit]

Poster: rin-q Date: Jan 24, 2015 6:23pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

Stumbled upon this discussion while searching for an old encyclopedia of Japanese folklore monsters which domain hasn't been renewed, and a way to gain access to these older entries from before the website disappeared.

So the domain has been bought by a reseller, and since a robots.txt file has been added, none of the information that was available two years ago can be reached via the Wayback Machine.

So a good example website would be obakemono dot com.

A big loss for those interested in Japanese folklore, sadly.

Reply to this post
Reply [edit]

Poster: PiRSquared Date: Jan 24, 2015 8:15pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

At least partially saved here: https://archive.today/www.obakemono.com

We can see it has been archived on the Wayback Machine as well by "exploiting" a minor bug (sorry Jeff), replacing the subdomain with "...": https://web.archive.org/web/*/...obakemono.com
Fortunately, we can access the content using this hack:
e.g. https://web.archive.org/web/20130527111513/http://...obakemono.com/obake/hoko/
Note that you'll need to change the "www" to ".." in the address bar for every link.

This post was modified by PiRSquared17 on 2015-01-25 04:15:46

Reply to this post
Reply [edit]

Poster: rin-q Date: Jan 27, 2015 7:06pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

Interesting.

Domain squatters and such can already be such a pain when doing research on the Web... That the biggest Internet « archive » prevents access to already archived pages on the basis that the current domain owner's has put up a no-crawler policy and without doing any kind of check-up isn't exactly great.

Couldn't there be, at least, a check against current and past domain record data? While the anonymization (to a certain extent, at least) of such records is possible, it could help determine wether or not the robots.txt should be ignored.

If anything, I guess this trick can serve as a way to access those already crawled pages while the issue gets sorted out.

Thanks a bunch!

This post was modified by rin-q on 2015-01-28 03:06:38

Reply to this post
Reply [edit]

Poster: PiRSquared Date: Jan 27, 2015 7:33pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

Is whois data archived?

Reply to this post
Reply [edit]

Poster: rin-q Date: Jan 28, 2015 10:03am
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

Well. I am aware of at least one service that, while I haven't personally tried it, provides whois record history. Domaintools being that service, they claim to have archived whois records since 1995 and one can gain access to these for a monthly fee.

Now, I wouldn't know wether the Internet Archive has such records (I can only hope so), but another way to, at least partially, check wether or not to respect the robots.txt would be to firstly ignore it and do sort of an integrity check with the last archived content and the current one. If the content is too different, then the robots.txt file should be ignored for already archived content, but not newer one.

Obviously, this probably wouldn't work for every cases, but that'd still be a better way to go, if you asked me. Or the robots.txt file could simply prevent new crawls while still allowing visitor access to already crawled content.

The current situation feels like a library making a book for consultation only, then erasing the past borrowers memories of the book because of the new consultation only policy. I mean, the data is still there (as you've shown me earlier), why not just allow access to it?

This post was modified by rin-q on 2015-01-28 18:00:57

This post was modified by rin-q on 2015-01-28 18:03:35

Reply to this post
Reply [edit]

Poster: #Danooxt3 Date: Mar 14, 2016 2:54pm
Forum: web Subject: Robots.txt

Many pages and data is unaccessibly lost forever- DUE TO ROBOTS.TXT

This message sucks.
I'm totally with you bro!

Reply to this post
Reply [edit]

Poster: jory2 Date: Dec 28, 2011 6:46am
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

"The wayback machine is exempt from copyright issues under fair use doctrine and due to its educational purpose."

Did you somehow miss(or completely misunderstand)all the educational materials and websites made available on the subject of Copyright Law and what is considered a "fair-use" of copyright protected works?
You typed and spelled the words correctly,fair-use-doctrine, did you bother to read the guidelines?

"Please stop ignoring website because of ignorant, uninformed, or possessive webmasters."

That should go over well

Reply to this post
Reply [edit]

Poster: Fizscy Date: Dec 28, 2011 7:42am
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

Yes I have read it. I'm a long-term contributor at wikipedia and I deal with Canadian and American copyright law on a regular basis because of that.

"The purpose and nature of the use.

If the copy is used for teaching at a non-profit institution, distributed without charge, and made by a teacher or students acting individually, then the copy is more likely to be considered as fair use."

The web archive is not a search engine crawler or similar robot, yet it seems to follow disallow requests for search crawlers just the same.

Second, the adding of that robots.txt has absolutely ZERO effect on the copyright and the fair use of the site. Nothing, nada, zip, zilch.

Third, domains change hands. Using a robots.txt file today to erase all previous copies on the archive is rediculous, especially since the copies may be of a different site.

The archive should only exempt sites that have specifically requested, to archive.org by email, that their website not be indexed.

Reply to this post
Reply [edit]

Poster: jory2 Date: Dec 28, 2011 8:39am
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

"Yes I have read it. I'm a long-term contributor at wikipedia and I deal with Canadian and American copyright law on a regular basis because of that."

Good, then I'll assume you're aware that "fair-use" is restricted to the U.S. Copyright Act and not the Canadian Copyright Act. And for what it's worth my field of study is Copyright and Intellectual Property Law.

"The purpose and nature of the use."

I'll assume you understood that to mean that not all Works can be argued under "fair-use"? Unless you applied your own special meaning to the fair-use clause of the U.S. Copyright Act(s)?

"If the copy is used for teaching at a non-profit institution, distributed without charge, and made by a teacher or students acting individually, then the copy is more likely to be considered as fair use."

I'll assume you're aware this website is privately owned and operated and receives private funds on top of government funds, and of course has the archive-it paid service. This website is considered a non-profit commercial website. It is not legally considered a Library and because of that will not be able to apply the limitations for Libraries as detailed in both the U.S and Canadian Copyright Acts.

"The web archive is not a search engine crawler or similar robot, yet it seems to follow disallow requests for search crawlers just the same."

What's your point?

"Second, the adding of that robots.txt has absolutely ZERO effect on the copyright and the fair use of the site. Nothing, nada, zip, zilch."

I'll assume you understood that content owners are not legally obligated to put a robot.txt file on their sites to prevent copyright violations.
Unless you have your own special meaning to that as well?

"Third, domains change hands. Using a robots.txt file today to erase all previous copies on the archive is rediculous, especially since the copies may be of a different site."

This website is not simply coping the name of the domain, this website is making copies of the intellectual properties on privately owned websites without the express permission of the rightful copyright owners.

"The archive should only exempt sites that have specifically requested, to archive.org by email, that their website not be indexed."

This website should only be making copies of websites that they received permission copy in the first place.

Reply to this post
Reply [edit]

Poster: Thestral Date: Apr 25, 2014 1:12pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

@jory2

You wrote...

"This website should only be making copies of websites that they received permission copy in the first place."

You do realize that making copies is literally the way these here internets work don't you? You cannot view a website without making a copy of it, be it a permanent copy (as here) or a temporary one (as in your browser cache and temp files). If your notion were to be made reality there'd be no internet to archive as no one could "copy" web pages to view them.

Aside from that you neatly avoided the main issue. The robots.txt policy here makes it possible for people with no rights over archives of certain intellectually property to literally wipe the last vestiges of said IP from the face of the web (just because they happen to have acquired a domain name that once belonged to the rightful IP holder).

That is a huge issue that needs a resolution which restores and protects these wrongfully removed archives while still allowing sites to nondestructively exempt themselves from archival going forward.



This post was modified by Thestral on 2014-04-25 20:12:02

Reply to this post
Reply [edit]

Poster: Mr Cranky Date: Dec 28, 2011 11:28am
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

What is your take on the Authors Guild stance about robots.txt?

Reply to this post
Reply [edit]

Poster: jory2 Date: Dec 28, 2011 12:13pm
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

I'm not familiar with the Authors Guild's stance or personal opinions on robot.txt files.
I am curious though, did you find it to be a interestingly humorous read like the misunderstandings that play-out in the forums on this website with respect to copyrights fair use and the legal definitions of libraries?

Reply to this post
Reply [edit]

Poster: jory2 Date: Dec 30, 2011 8:58am
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

I have been looking for the Authors Guild stance on robot.txt files, I haven't found much on it though.

I did come across the Internet Archive's stance on robot.txt files however.

Starting January 2010, Archive-It is running a pilot program to test a new feature that allows our partners to crawl and archive areas of sites that are blocked by a site's robots.txt file.

"Partners who have a need for this feature should contact the Archive-It team to let us know what sites and why you would like to use this feature. It would be helpful to know if you have previously contacted the site owner about allowing our crawler to crawl their sites, and what their response was (if any). We ask our partners to use this feature only when necessary. Also, please keep in mind that many things that are blocked by robots.txt are parts of a site that you wouldn't necessary want to archive, so please be sure to review the urls that are blocked in the 'Hosts Report' for your crawl to determine if you need this feature or not."

Oddly enough this stance seems to be a complete 180 on this websites TOS.

Reply to this post
Reply [edit]

Poster: Andy The Penguin Friend Date: Nov 5, 2014 11:22am
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

This is an agonizingly horrible situation.
An example of a (slightly) happy ending to this problem, is that the official Heart of Darkness site (Heartofdarkness.com) has been around since mid-1990s. It went down sometime around 2004 but squatters who took the domain didn't implement robots.txt to prohibit webcrawling until far later (sometime after 2008)
I of course was devastated. I wanted to revive the site as a fan tribute but was unable to access the archived information through wayback like that. However, I found who owned the domain by going through Godaddy (I intended to buy it back before they said they'd only sell it for $5000. Yikes.) I explained that the robots.txt was making it impossible for me to view archived data from previous years, and that it broke my heart that I couldn't see it, and they altered the robots.txt so that I could see it again.
If wayback has archived it before the robots.txt the addition of robots DOES NOT delete the content, just makes it inaccessible. I encourage people to contact the current domain owners about it. That guy was very nice to me and was happy to help.
Unfortunately though, the HOD site was also on "AmazingStudio.com" and there's a whole new can of worms due to some of the newer files being hosted on that domain instead of heartofdarkness.com I can't figure out why files from there through wayback show up as "forbidden"

Anyways, it's a shame that robots.txt has to exist. What's its original intended use anyway?

Reply to this post
Reply [edit]

Poster: user001 Date: Nov 6, 2014 7:24am
Forum: web Subject: Re: Why does the wayback machine pay attention to robots.txt

Its my understanding that the robot.txt file is to prevent search engines from crawling a website. Why the archive thinks it's a useful tool for site owners and the archives website is a mystery? It makes sense for search engines but not for a site that's making permanent copies. Even less useful for people who don't admin the website their IP is on. Maybe the people running this site thought it would be a way to avoid liability for copyright infringement?